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ABSTRACT 


An advanced development: program, conducCed at the Jet Propulsion 
Laboratory, to develop the technology for autonomous operation of planetary 
spacecraft power systems, initiated a study to develop a methodology for 
selecting an optimum microcomputer architecture. 

Unique to most applications of microcomputers, performance 
requirements such as throughput speed and data handling capacity, are not as 
significant to autonomous operation of a spacecraft power system as they are 
to more common applications such as signal processing and data manipulation. 
Planetary spacecraft power systems, however, are complex in terms of the 
number of different functions performed. Spacecraft power systems are also in 
a unique class, on which the total mission is dependent; therefore, 
reliability and fault tolerance are primary requirements. 

Various microcomputer system architectures are analyzed to 
determine their application to spacecraft power systems. 

Of the many microcomputer system architectures analyzed and 
discussed, no dominant system topology, applicable to automating spacecraft 
power systems, emerged. Indeed, there exists no standardized formula or 
common set of guidelines which will provide an optimum configuration for a 
given set of specifications. 

Future work is shown to be necessary to develop performance and 
reliability models oi’ alternate microcomputer architectures as a methodology 
for optimizing system design. 
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SECTION 1 


INTRODUCTION 

The trend of onboard computational capability, to satisfy more 
demanding mission requirements, has increased dramatically over the past 10 
years. Ultra-reliable computer architectures are necessary for data acqui- 
sition and real time control. The purpose of this study is to investigate 
microcomputer system architectures with particular application to developing a 
design methodology for spacecraft power systems. Initial study effort has 
focused on two issues: 

(1) Examination of several microcomputer architectures which may 
be suitable for spacecraft power system monitoring and 
control. 

(2) Investigation of currently available redundancy/fault 
tolerance techniques. 

Relatively speaking, it Is easy to design a computer system, but it 
is very difficult to design a system that is optimized for a given set of 
requirements. In other words, techniques for logic design and system 
l^fo, ^ramming are fairly well understood, and there are also a number of 
techniques for analyzing the performance of a computer system. But designing 
a system that will perform well in a specific application area is a very 
intuitive undertaking. A good deal of experimentation is usually involved. 
This report examines possible design approaches, the trade-offs Involved, and 
points out factors which affect the choice of computer system architecture for 
a spacecraft power system. 


SEC'I'ZON 2 


POWER SYSTEM PERFORMANCE EVALUATION AND CONTROL REQUIREMENTS 

To establish a basis for the performance requirements of a power 
system computer network, typical power system functions and estimated timing 
requirements were reviewed and are summarized In this section. 

Current spacecraft power systems typically perform the following 

functions: 

(1) Load switching 

(2) Power processing 

(3) Fault detection and correction 

(4) Battery charging 

(5) Battery reconditioning 

The Implementation and priority of the above functions may change subject to 
mission requirements and type of energy source. 

Table 2-1 Is a list of functions which have been Identified 
(Ref, 1) as candidates for autonomous control on future planetary power 
systems. 

The magnitude of the number of measurements and control commands 
necessary for autonomous monitoring and control of a planetary spacecraft 
power system Is shown In Table 2-2, 
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Table 2-1. Typical Power System Functions and Their 
Computational Requirements 


Power System Functions 

Estimated Time Requirement 

Computai^ionsl Assessment 

1. 

Fault Detection 
and Correction 

id-ms response 

Simple logical 
processing 

2. 

Command Processing 

1-ms decode time 

Moderate logical 
processing 

3. 

Relay Status 
Monitoring 

As required 

Simple logical 
processing 

4. 

Relay Control 

10-ms 

Simple logical 
processing 

5. 

Data Acquisition. 
Processing and 
Storage 

All parameters every 
lOO-ms 

Moderate logical 
processing 

6. 

System monitoring 
and diagnosis 

1 s 

Moderate logical 
processing 

7. 

Subassembly 
Monitoring and 
Diagnosis 

As required 

Moderate logical 
processing 

8. 

Load Sequencing 
and Control 

100-ms response 

Moderate logical 
processing 

9. 

Load Equipment 
Monitoring and 
Diagnosis 

100-ms response 

Moderate mathematical 
processing 

10. 

Power Capability 
and Margin 
Management 

1 to 10 s 

Complex mathematical 
processing 


Tabis 2-2. EitlMtad Pout?? Syttta CioBiands and Maaauraaanfca 
for Auconoaous Oparafcion 


Subassembly 

Analog Maaiuremtnts 

Relay Commands 

Battery Electronics 

' " j ’ 

8 

14 

Solar-Array Electronics 

12 

4 

Power Control 

13 

6 

Battery Charger 1 

1 

5 

Battery Charger 2 

1 

5 

Boost Regulator 1 

2 

0 

30“Vdc Converter 

2 

4 

Power Distribution 

34 

42 

Battery 

30 

56 

Total 

103 

136 


Although this table aunaarizea a particular spacecraft power system design 
incorporating autonomous functions, it can be seen that hundreds of 
measurements and control commands are typically necessary (Refs. 2,3,4). 





SECTION 3 


FACTORS AFFECTING SELECTION OF COMPUTER ARCHITECTURE 

The dati?. processing requireasnts (speed, tlaing, end coaputstionaX 
coaplexity) for a spacecraft power systea are not as deaanding as those for 
soae other subsysteas, (e*g. Attitude Control). However, the criticality of 
the power systea to aission success dictates that the data processing and 
control functions be highly reliable. Soae of the key factors which affect 
the selection of architecture are discussed below. 


3.1 TYPE OF COMPUTER 

Unless there are clear indications that a particular aicrocoaputer 
is required (e.g. by throughput requlreaents) , designers usually select one 
with which they are faalllar and/or one whose developaent systea is available 


3.2 PERFORMANCE 

The first decision to be aade is whether one coaputer can aeet the 
throughput requirements demanded by the system. Designers find it difficult 
to generalize on the procedures and thought processes they use to make this 
decision. Some general comments follow. 

Usually a synchronous executive is written where measurements and 
monitoring tasks are cycled through at a specified rate chosen by the rate at 
which the central computer needs data and/or the rate at which critical load 
management or survivability actions need to be taken. Any necessary 
calculations and logical decisions, along with subsequent corrective actions, 
must be processed well within the synchronous executive's cycle. Therefore, 
in deciding whether one computer is sufficient, the following steps may be 
taken: 


(1) Translate the functional requirements of the system into 
precise specifications of required tasks. 

(2) Develop efficient algorithms to accomplish those tasks. 

(3) Examine the timing and memory requirements for the 
algorithms; this task Includes: 

(a) Estimate lines of code necessary 

(b) Estimate time required for execution 

(c) Estimate memory requirements, including access time. 

(4) Establish time line - or sequence of tasks - listing all 
tasks in the order (and at the frequency) which they must be 
executed for a sample executive clock cycle. This includes 
recording measurements, simultaneous checks for load faults. 
Interrupts to eliminate faults, or control signals to relays 
within critical times for effective problem resolution, etc. 
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(5) Examine latency reatrlctlona (l.e.» Once an error la 
detected I how long can one wait until a corrective action 
muat be taken) and look for waya to overlap taaka or condenae 
the time line. 

(6) Add a ''comfortable” time margin for aoftware overhead 
(executive control, computer communication aoftware, etc). 
Some dealgnera chooae aa much aa SO percent margin. 

The reault of thla procedure la an eatimate of the time required to 
proceaa the neceaaary tasks with the chosen computer. If all tasks cannot be 
executed well within the synchronous executive's cycle, then a faster computer 
or a multicomputer system should be considered. 

The throughput requirements of a spacecraft power system can 
generally be met by efficient use of a single computer. If more speed Is 
necessary, a faster computer or more computationally efficient algorithms may 
be chosen. The decision of going to a multicomputer or distributed network is 
usually made for other reasons. These are discussed In subsections 3.3 
through 3.5. 


3.3 RELIABILITY 

Computers for use in planetary spacecraft power systems will 
perform functions which are computation-critical and which require long life. 
The equipment cannot be maintained, repair Is impossible, yet reliable 
operation Is demanded for the duration of the mission (5 or more years). This 
imposes the most stringent fault tolerance requirements In a real-time 
environment to avoid Jeopardizing the success of the mission. These stringent 
requirements can be met by a reliable version of a single computer system. (A 
fault tolerant uniprocessor - the self-testing and repairing (STAR) computer 
was developed at JPL (Ref. 5)). To achieve reliability, such a system usually 
requires redundancy and, If graceful degradation is desired, a fine 
partitioning of the computer system Into programmable replacement modules. 

Partitioning can occur at different levels. With many machines in 
the past, partitioning was at the subprocessor level. With current 
technology, It makes little sense to partition a system below the level of a 
microcomputer. Thus due to very large scale Integration (VLSI) technology, 
the partitioning concept has evolved Into an architecture in which individual 
computers make up the replaceable system modules. Such a distributed computer 
network is well suited to applications like the power system where the 
computing system controls a number of relatively autonomous (although possibly 
functionally interdependent) subelements (l.e.. Inverter control and load 
management). 

Thus, even If a single computer can handle the throughput 
requirements, reliability goals may require a distributed network, 
particularly If the reliability goals include graceful degradation. 
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3.4 CONSIDERATIONS 

By embedding small, dedicated processors Into functionally 
partitioned subsystems, several advantages result. 


3.4.1 Ease of Development 

Subsystem designers, who are most familiar with their own equipment 
can develop Independent software necessary for Its peculiar control and/or 
fault diagnosis If a multicomputer approach Is taken (l.e., software design Is 
modularized). Also, If local subsystems are Independent, local control 
frequently results In simpler higher level control and data handling programs. 


3.4.2 Survivability 

Graceful degradation Is possible In distributed multicomputer 
systems because the total system can be designed to continue to operate 
despite Individual computer failures. 

3.4.3 Flexibility 

As future spacecraft systems change In size and complexity, system 
redesign Is simplified by Incrementally deleting or adding microcomputers and 
modifying software. 

These benefits are, however, accompanied by some disadvantages. 

The designer is faced with increased software complexity. Distributed systems 
typically require their own executives which must communicate with other 
executives In the system (or other systems). This also means the distributed 
system is more depend& 7 .c un computer communication technology. In addition, 
overall diagnostic software development is usually more difficult In 
multicomputer systems. 


3.5 BOTTOM LINE 

The choice of using a single computer or a multicomputer network is 
a function of long-term design objectives. If the power system Is to be 
custom redesigned for each mission, then a practical engineering approach will 
probably result in a single microcomputer with a standby unit to avoid a 
single point failure. However, if a general power system design is desired - 
one which is flexible and can be "programmed’* to ease development efforts for 
different missions, then a distributed multicomputer approach seems more 
appropriate. 
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SECTION 4 


POSSIBLE MULTICOMPUTER ARCHITECTURES 

* Six basic multicomputer architecture types (or interconnect 
technologies) are described in the literature: 

(1) Shared memory 

(2) Shared bus 

(3) Loop systems 

(4) Star configurations 

(5) Hierarchical configurations 

(6) Point to point interconnections 

Each topology has certain attributes that affect its suitability 
for power system applications t These attributes are related to cost, 
reliability, performance (responsiveness, speed, throughput), ease of 
development, modularity, reconfigurability and survivability, and such 
physical parameters as volume,^ weight, and power consumption. 

Some of the more comtson Interconnect technologies (with a few 
variations of the basic six) are briefly compared based on selected design 
attributes in the discussion which follows. The architectures are discussed 
in the order of decreasing reliability based on vulnerability to a single 
component failure. 



The completely Interconnected architecture is conceptually the 
simplest design. Each processor is connected by a dedicated path to every 
other processor. Communications software becomes extremely complicated as the 
number of processors increases. 


Cost: 

Modularity: 

Reliability: 


High - function of the number of micros in the system. 
Fair - number of ports on each micro is N-1. 

Most reliable'-only local problem if micro falls. 
Redundant paths alleviate single link failures. 
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4.2 


PACKET SWITCHED NETWORK 



Messages are brok^^n into packets and transmitted by way of 
available nodes. At least two paths exist between any two computers in the 
system. 


Cost: High - each node requires routing control. 

Modularity: Good. 

Reliability: Only local problem if a computer falls. 


4.3 REGULAR NETWORK 



Every computer is connected to its own neighbor and another 
computer above and below it. The network gets complicated if there are very 
many computers. The "tree" is a hierarchically structured variation with any 
computer able to communicate with its superior and its subordinates as well as 
its two neighbors. 

Cost: High - function of number of computers in system. 

Modularity: Poor. 

Reliability: Only local problem if computer falls. Redundant paths 

eliminate single connect failures. 
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IRREGULAR NETWORK 
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The heirarchy configuration is used in process control and data 
acquisition applications* The capabilities are specialized at lower levels 
and more general purpose at the top. 

Cost; Medium - function of distance between computers. 

Modularity: Good. 

Reliability: Systems operability reduced with single point 

failure, more serious the higher up the failure 
occurs . 



Loop architecture evolved from the data communication environment. 
In this configuration, each computer is connected to two neighboring 
computers. The data can flow in both directions, but circulating traffic in 
one direction is less complicated. 


Cost: 

Modularity: 

Reliability: 


Medium - main cost is adapters. 

Good - limited by addressing capability. 

System unaffected with single loop failures 
for a redundant two-loop system - catastrophic 
for single, unidirectional loop. 


4.7 GLOBAL BUS 



The use of a common or global bus requires some allocation Scheme 
for sending messages from one computer to another. 
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Cost: Medium - main cost Is bus adapters. 

Modularity: Good. 

Reliability: Only local problem If a computer falls - catastrophic 

with bus failure. 


4.8 STAr. 



The star configuration has a central switching resource. Each 
computer Is connected to the central switch. Traffic Is In both directions. 

Cost: Medium to Low - major cost Item Is switch. 

Modularity: Good - until switch saturates. 

Reliability: Only local problem If a computer falls - catastrophic 

If switch falls. Switch Is possibly less reliable 
than bus or loop. 
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LOOP WITH SWITCH 



This refinement of the loop provides a switching element that 
removes messages from the loop, maps their addresses, and replaces them on the 
loop properly addressed to their intended destination. 

Cost: Medium ~ main cost is switch. 

Modularity; Good-Fair, until switch saturates. 

Reliability: Catastrophic if either switch or loop fails. 


4,10 BUS WINDOW 



The bus window configuration has more than one switch. Messages 
may be transmitted on the path they are received or on another. The switches 
provide "windows" for passing messages between buses. 

Cost: Low - main cost is switch. 

Modularity; Poor. 

Reliability: Serious contention problems. Partial system failure 

if switch or bus fails. 


4.11 BUS WITH SWITCH 




This is more like the global bus, since each computer is connected 
to the central switch and traffic flows from the originating computer to the 
switch, and from the switch to the destination computer. The computers share 
the path (bus) to share access to the switch. 

Cost: Low - main cost item is the switch. 

Modularity: Good - Fair, until switch saturates. 

Reliability; Catastrophic if bus or switch fails. 
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4.12 


SHARED MEMORY 


The most common way to Interconnect computer systems is to 
communicate by leaving messages for one another In a commonly accessible 
memory. The key characteristic is that the the memory Is used as a data path 
as well as storage. 

Cost: Low - main cost Is multlported memory. 

Modularity: Poor - limited to number of memory ports. 

Reliability: Least reliable - catastrophic if memory fails. 

In the design of multicomputer systems, the consideration of all 
possible interconnect technologies may not be necessary. Practical aspects of 
specific applications frequently lead to a limited choice of architectures. A 
methodology for making such decisions is discussed in the next section. 
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MULTICOMPUTER DESIGN METHODOLOGY 

Since the design of distributed microcomputer systems is an art 
dependent on experience, there exists no standardized formula to provide the 
optimal configuration. Key attributes in a typical data acquisition and 
control system are performance, reliability, availability, fault tolerance, 
and failure reconfigurability. Other attributes, but slightly less Important, 
are life-cycle cost and modularity/growth. System design becomes a trade-off 
analysis weighing the relative contributions of alternate architectures to 
maximize the important attributes of the system. Although a methodology for 
an optimum universal design is virtually impossible, there are some general 
statements which can be made concerning the choice of microcomputer 
architecture and the subsequent Implementation of fault tolerance. 


5.1 CHOICE OF ARCHITECTURE 

The design of a distributed microcomputer system is primarily a 
function of the experience of the designer. It is usually approached in a 
sequential fashion with the following considerations. 


5.1.1 Problem Definition (or Process Identification) 

It is not necessarily clear from the functional requirements what 
the consequences are of the specific tasks required by the system. It is 
necessary to determine as precisely as possible What is to be automated. This 
should also Include the number and type of measurements, the number and type 
of controlling functions and signals, the relative criticality of each of the 
above, and the timing requirements. It is also necessary to identify the 
specific communication requirements in order to interface with other computers. 


5.1.2 Problem Decomposition 

This is a functional breakdown of the system requirements. The 
value of identifying major functional groups is that the designer will develop 
an understanding of the major subtasks to be performed by the system with a 
qualitative feel for the workload imposed by each function. One should 
identify critical functions, which may demand ultra-reliability, and those 
which may be allowed to gracefully degrade. This usually leads to allocating 
separate processor memory resources to handle different functional groups. 

Some schemes have been developed to aid the designer with this 
task. Weltzman (Ref. 6) uses a structured set of data-flow prlmatlves which 
are arranged in process architecture trees. This phase also embodies the 
experience of the designer. 


5.1*3 Process Interaction 

One would like to obtain and formulate adequate quantitative 
knowledge of the information flow between functions. Analytical tools 
available to develop an understanding of the various interrelationships 
between subprocesses include state exchange diagrams, process interaction 
diagrams, and charts (Ref. 6). 


S.1.4 Performance Requirements 

One must define as specifically as possible the system's physical 
performance requirements. These Include 

(1) Sizing of tasks 

(Z) Defining relationships between tasks 

(3) System control for information movement and processing which 
involves identifying: 

(a) Information transfer strategy 

(b) Transfer control method 

(c) Transfer path structure 

(d) Shared and dedicated system resources. 


5.1.5 Choice of System Architecture 

In the selection of appropriate hardware and software elements and 
system structure, the consideration of all possible architectures may not be 
necessary. The pros and cons of information transfer strategy, control 
methods, and path structure may provide an indication of the most attractive 
solutions. Figure 5-1 indicates a general methodology for such choices. 

































SECTION 6 


HIERARCHICAL CONFIGURATION TECHNOLOGY 

Of all the Interconnect technologlea dlecuseed, data acquisition 
and process control systems have most frequently been based on hierarchical 
architectures. This does not mean that loop or bus systems should be 
automatically ruled out. However, some design considerations for hierarchical 
structures should be mentioned. 

A hierarchical configuration, as its name implies, consists of a 
tree structure of computers. In general, the capability of the computer 
Increases as the top of the pyramid is reached. This is often due to 
practical rather than theoretical reasons. In a manner similar to a corporate 
organizational structure, the capabilities at the base are generally 
appllcatlons-dependent, with a special-purpose capability, dedicated to 
performing well-defined, specialized tasks, whereas the top of the 
organization has a more general-purpose capability, controlling and 
coordinating the entire system. In such a system, computer function^! are 
usually distributed. The tedious repetitive functions and algorithms, such as 
data collection and reduction, are handled at the lowest levels, whereas data 
processing and command execution (control) are performed at the top. 

Typically, shared data bases are also stored at the top rather than 
distributed throughout the system. 

The partitioning of overall system processing loads into 
approximately equal-size processing segments can make it possible to use one 
type and size computer in the system (at least in the lowest level). This has 
an advantage in that, since all computers are identical, the system may be 
implemented in such a way that a standby unit is always available and can be 
switched online to perform the tasks of any other computer in the system 
(should one become inoperative). Thus reliability can be improved without 
complete redundancy. 

From a reliability standpoint, if a failure occurs in the computer 
located at the top of the pyramid, total system control is lost. This 
requires a redundancy along with doubling all communications paths at the 
top. An f;>iample of such a structure is shown in Figure 6-1. Thus, the 
addition of redundancy greatly increases the complexity of the system as well 
as software overhead. (Doubling of hardware does not necessarily double the 
reliability of the system. See the reliability discussion in the Appendix.) 

Whether redundancy is used or not, hierarchical microcomputer 
systems should be designed to be capable of operating in a degraded mode. The 
loss of a single computer should result in the absolute minimum amount of 
information being lost and should not cause the entire system to cease 
functioning. To ensure operation in a degraded mode, the following design 
features should be incorporated: 

(1) When a low level computer fails, all of its process outputs 
should be frozen and transfers should automatically be made 
to backup control by reconfiguring the system (e.g. by 
switching to a spare). 
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Figure 6-1. System Reliability Increased With Redundant Control 
and Spare Local Computers 


(2) Each computer should be able to store, for a reasonable 
period of tire. Information destined for another computer. 
This Information could be transmitted when the target 
computer becomes operational again. 

(3) No computer should depend solely on Information arriving from 
another computer. Crucial programs should always exist at 
the site where they are needed. Mathematical results or 
measurements should be replacements for old results 
(calibrations for example), and old results should continue 
to be used until new ones become available. 
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IMPLEMENTING FAULT TOLERANCE 

Ricent lltemt:ure» bated on reieerch, analysle, and experience 
accumulated over the paat decadei Indlcatea definite guldellnea exlat for the 
Implementation of tolerance of physical faults In digital ayatena (Ref. 7). 
These are aumnuirized as follows: 

(1) Devise a fully aatlsfactory system according to given 
performance specifications • assuming fault-free conditions. 

(2) Specify reliability goals for the system. 

(a) Explicitly Identify classes of faults that are to be 
tolerated. (This usually limits the faults to less 
than all possible things that can go wrong.) 

(b) Specify quantitative reliability goals for each fault 
set. 

(c) Postulate a method to evaluate actual reliability. 

(3) Select and Incorporate fault-detection algorithms. This 
usually leads to the addition of new elementi or software 
accomplishing parity checks, self-test programs, etc. 

(4) Devise recovery algorithms which are evoked by signals from 
fault-detection algorithms and whose goal Is to return the 
system to some level of normal operation, or to shut part of 
It down safely. Recovery consists of all actions that take 
place after the fault is detected. These may Include: 

(a) Error correction 

(b) Fault location 

(c) Exclusion Or replacement of failed parts 

(d) Recording of actions taken 

(e) Restart of normal operation. 

This may Involve addition of spare computers or bus elements, 
or increase in memory size, etc. 

A special form of recovery results from the use of 
fault -masking techniques in which redundant elements 
Instantly conceal the effect of faults without a separate 
fault detection being required. 

(5) Evaluation Is performed by means of modeling and/or 
simulation. Reliability prediction Is compared to that of 
the original system. Degradation of performance Is noted for 
each fault set. 
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(6) Reflneaent of design !■ performed. Initial evaluation la 

likely to deiaon$triBte that varloua aubayateas diaplay unequal 
reliability contributiona to the total ayatem reliability. 

Hardware Implementation of fault tolerance to phyalcal faulta haa 
led to aeveral ayatem dealgn concepta. Triple modular redundancy (TMR), 
atandby, hybrid, aelf-paglng, and duplex redundancy technlquea are aome of the 
achemea diacuaaed in the literature (Ref a. 8-12). 

Phyalcal faults are not the only events that disrupt the specified 
behavior of digital systems. Many events can be traced back to some imperfect' 
ions in the software that had remained unidentified. At least two approaches 
to software fault-tolerance design have appeared in the literature. They are 
the recovery block (Ref. 13) and N-version programming (Ref. 14). Both 
methods ute some redundancies analogous to successful fault-tolerance 
approaches to physical faults. 
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CONCLUS IONS /RECOMMENDATIONS 

The choice between a single computer or multicomputer architectures 
is determined primarily by the system considerations of performance, 
reliability, and flexibility. For most power systems, a single microcomputer 
will handle throughput requirements. Multicomputer configurations may be 
chosen for the additional considerations of flexibility and ease of 
modification. In addition, distributed control of a multicomputer system may 
provide the benefits of graceful degradation and considerable fault 
tolerance. Hierarchical configurations are most frequently used in similar 
applications and appear to be an adequate compromise between maximizing fault 
tolerance and flexibility. Although not the only scheme possible, these 
systems can be made reliable with redundancies and/or spares, yet permit 
modular design for ease of development and modification. 

A general design methodology is presented for both single computer 
and multicomputer systems. For either approach, a combined hardware/software 
fault tolerant design has the most advantages. Hardware redundancies Increase 
the reliability of the physical systems, but extra software efforts can provide 
more than a computer system with built in spares - that is continued 
computational and control capability. 

It is recommended that the power system be standardized Including 
bus characteristics, power processixig equipment, data bus interfaces, battery 
cells, etc. The benefits of a distributed multicomputer system can be gained 
by implementing a reuseable power system design, incorporating sufficient 
flexibility for expansion to a wldfe range of missions. In such a system, 
modification of control functions or system reconfiguration can become a 
matter of software manipulation rather than major hardware change. This can 
also permit development of analytical methods to model the system's 
performance and reliability. These tools, in the form of computer programs, 
can then be used to optimally reconfigure the system to suit new mission 
requirements. 
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APPENDIX 


APPLICABLE NOTES ON RELIABILITY 
A-1 Definition of Reliability 

The definition of reliability commonly accepted for engineering 
applications is the characteristic of a component or system, expressed by a 
probability, that It will perform a required function under stated conditions 
for a specific period of time* Models are usually developed to calculate and 
compare reliability of alternate systems. Since multicomputer systems 
frequently are required to carry out more than one type of function (e,g. , 
load management and battery conditioning), separate reliability models for 
each of these functions may be necessary to make the problem more tractable. 

Several parameters may have a marked effect on the reliability of a 
given system. These Include: environmental conditions (temperature, 

humidity, vibration, etc.), operating conditions (voltage, current, power 
dissipation). 

When comparing alternate systems, the relative system reliability 
can be measured both quantitatively and qualitatively. 

Quantitative measurements: 

o Mean time between failures (MTBF) 

Usually specified In hours, this can be related to component 
reliability and type of redundancy. 

o Mean time to repair (MTTR) 

Also In hours, this can be minimized with built-in 
redundancy, real-time self check, and diagnostics. 

o Failure reconfiguration time 

A reliable system requires redundant paths and/or 
microprocessors that can be activated as soon as a failure Is 
detected. The time to reconfigure may be critical to avoid 
system failure. 

Qualitative measurements: 

o Graceful degradation 

This Is applications related. One cash register falling Is 
not a great loss (except on Friday nights), but one part of a 
measurement system In a spacecraft may be. Computers must be 
connected In such a way as to minimize the effect of failuce. 
on the total system. 

o Fault tolerance 

This attribute allows a system to function when some 
component falls. The level or depth of fault tolerance 
depends on the fault set and recovery procedures* 
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A-2 Availability 


Availability is a tern frequently used when discussing Reliability, 
since it also is neasured by the sanss parameters. Availability is \ieflned as 
the percentage of time a microcomputer system is up (available). It^may be 
expressed quantitatively as follows: 

4 _ MTBF 

^ " H T B F -f H TT R 

From this equation, it can be seen that availability can be 
improved by increasing the MTBF and/or decreasing the MTTR. 

The ultimate in MTTR can be achieved by having spare units wired 
into the system either as hot or cold standbys. This combined with automatic 
fault-detection devices and an automatic reconfiguration capability that 
switches failed units out and backup units in, reduces MTTR to virtually zero. 


Such fault tolerant design requires additional critical components, 
however, that are in turn subject to failure. 


A-3. Reliability of Interconnected Components 

The failure pattern of equipment placed in service can be 
categorized into three periods of operation, as illustrated in Figure A-1 . 


At the very beginning, any inherently weak parts that are the 
result of improper design. Improper manufacture, or improper use usually fall 
fairly soon. The early failure rate decreases progressively and eventually 


FAILURE 

RATE 



Figure A-1. Typical Bathtub Curve of Failure Rate Versus Time 
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levels off as the weak components are replaced (usually during tests under 
accelerated conditions)* Spacecraft systems, which are :(on-repalrable during 
missions should be operated for a period of time under varying conditions to 
ensure detection of early failures. After the early failures have been 
replaced, the components settle down to a long, relatively steady period at an 
approximately constant failure rate. The normal working life of a system 
occurs during this Interval. In the wear-out period, the components rapidly 
deteriorate, and each component eventually wears out. 

A reliability calculation may be made rather simply during the 
constant failure rate portion of the curve. The constant failure rate implies 
that the probability of failure is independent of age. A reliability function 
so characterized is the negative exponential distribution 

R - e"^-t 

where 

X ■ failure rate, t ■ time. 

It is assumed that at t « 0, all components are operational. 

A physical system consists of many different types of components, 
each of which has a different instantaneous fa!,', lure rate. The ultimate 
concern of the designer is the reliability of the total system. 

Logically, the components are connected in either series or 
parallel (as with redundancies). The reliability of such interconnections of 
components (whose individual reliability functions are exponential) may be 
derived and is summarized in Table A-1. 

It can be seen that the system reliability increases with the 
number of parallel paths and that it decreases with the number of units in 
series. 

Effect of Redundancy on MTBF. 

The overall reliability of a system may be improved by adding 
redundancy so that, if one unit fails, another is available to perform the 
necessary functions. There are active and standby types of redundancy. The 
parallel configuration discussed previously is an active type in which the 
redundant elements are continuously energized and used to perform the required 
circuit or system functions. In standby redundancy, the additional units are 
activated only when needed. The advantages of active and standby redundancy 
can be expressed in terms of the mean time between failures (MTBF). MTBF is a 
quantitative measure of reliability which may be expressed as the integral of 
the reliability function. 

MTBF - j R(t)dt 

•'o 
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Table A-1. Reliability of Series and Parallel Connections of Elements 
with Exponential Reliability Distributions (Ref. 6) 


Connection 

Reliability 


n 

n series elements 

R-1 [r, 

i-1 


in 

m parallel elements 

R - 1 - M (1 - R^) 
i*l 

m parallel paths with 
n series elements 

m 

R-1 - (1 - r") 


If the reliability functions of redundant computers have the same 
exponential form (i.e. identical failure rates), then the combined redundant 
system will have MTBF as given in Table A-2. 

For the standby case, the spare unit remains unused until placed in 
service. The active redundant spare is used continuously and wears out along 
with the original. Thus, the MTBF of the standby configuration Is twice the 
value of the one-unit configuration, whereas the active redundant pair (as 
calculated from the reliability of a parallel connection) has a smaller MTBF. 

A-5. Fault Tolerance 

Fault tolerance is the attribute of a digital system which makes it 
possible for a logic machine to continue with its specified tasks after the 
physical system suffers failures of its components. The implementation of 
fault tolerance is an approach to system design whose purpose is to Increase 
reliability (or the probability that the system will function as designed). 

Fault tolerance is the survival attribute of a logic machine 
because its purpose is to cause a return from error states back to a specified 
behavior, thus assuring the survival of the Information processing system 
(Ref. 7). 


The presence of fault tolerant features does not add any 
performance advantages during normal (fault-free) operation. On the contrary, 
fault-tolerance usually requires additional hardware and/or software that is 
redundant during normal operation and would be superfluous in a fault-free 
system. 
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Table A-2. MTBF of Redundant Computer Configurations 


Configuration 

MTBF 

One unit, no redundancy 

1/2 

Two units, active redundancy 

3/27 

Two units, standby redundancy 

2/7 


To increase reliability, the only alternative to fault-tolerance is 
fault-avoidance, which requires the physical components and their assembly 
techniques to be perfect. As outlined previously, because of the constant 
rate of random failures (\), the reliability of a system without redundancy is 
R a> The only way to increase system reliability is to force X as 

close to zero as possible. 

The exclusive use of fault-avoidance has two serious drawbacks: 

o Cost of obtaining nearly perfect components rises very 

rapidly after failure rates have been reduced to threshold 
values that are characteristic of the physical parameters and 
manufacturing technology of the components. 

0 Since the system will cease proper operation upon the first 
failure or malfunction, manual maintenance is necessary. 

Many systems have combined fault-avoidance and manual maintenance 
as a method to assure reliability (Ref. 7). This is not practical in space 
vehicles. There are strong reasons for the use of fault tolerance in 
spacecraft computer design: 

o Initial Investment in fault-tolerance can reduce the lifetime 

cost of the system. 

o Space vehicles are placed in environments that do not allow 
access for manual maintenance. 

Fault tolerant design Involves Implementing hardware and/or 
software redundancies, fault detection, and reconfiguration strategies. 

Typical steps in a recovery strategy (Ref. 8) are: 

0 Initial fault diagnosis 

0 Identification of faulty module 

0 Determine reconfiguration strategy 
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o 


Perfoni reconfiguration 
o Condition new eleaenta 

o Recover elapsed tine 

0 Rollback application prograaa 

Methodology of Implementing fault-tolerance la diacuaaed in 

Section 7. 
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